library(tidyverse); library(psych); library(knitr); library(kableExtra)
alc <- read.table("data/alc.csv", sep = ",", header = T)
The data is Student Performance Data Set from UCI Machine Learning Repository. It approaches student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). Full description of the original data can be found here.
The two datasets have been merged using several variables as identifiers to combine individual students information: school, sex, age, address, family size, parents’ cohabitation status, parents’ education and job, reason to choose school, attendance to school nursery, and home internet access. If the student had answered to the same question on both questionnaires, the the rounded average was calculated. If the question was non-numeric, the answer on Mathematics performance dataset was used.
The R script about creating the merged dataset can be found here.
## [1] X school sex age address famsize
## [7] Pstatus Medu Fedu Mjob Fjob reason
## [13] nursery internet guardian traveltime studytime failures
## [19] schoolsup famsup paid activities higher romantic
## [25] famrel freetime goout Dalc Walc health
## [31] absences G1 G2 G3 alc_use high_use
| Variables | Information |
|---|---|
| sex | ‘s sex (binary: ’F’ - female or ‘M’ - male) |
| Medu | mother’s education (numeric: 0 none, 1 primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education) |
| failures | number of past class failures (numeric: n if 1<=n<3, else 4) |
| absences | number of school absences (numeric: from 0 to 93) |
| high_use | high alcohol consumption (TRUE: the average of self-reported workday and weekend alcohol consumption greater than 2 on a scale 1 -very low - 5 very high, FALSE: the average 2 or lower) |
Let’s take a glimpse of the structure and dimensions of the subset of data we are using:
# subsetting
sub <- c("sex", "Medu", "failures", "absences", "G3")
glimpse(alc[,sub])
## Observations: 382
## Variables: 5
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,...
As Medu is truly a categorical variable, not a numerical one, it will have to be recoded into a factor.
alc$Medu <- factor(alc$Medu)
levels(alc$Medu) <- c("None", "Primary", "5th to 9th", "Secondary", "Higher")
The purpose of my analysis is to study the relationships between high/low alcohol consumption and students’ demographic, social, and school-realted characetristics.
I chose following variables to explain the students’ alcohol consumption: student’s sex, father’s education, class failures, and absentees.
describe(alc[, c("sex", "Medu", "failures", "absences", "G3")]) %>% kable(digits = 3, caption = "<b>Table 2</b> Descriptives of chosen variables") %>% kable_styling(bootstrap_options = c("striped", "hover"))
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sex* | 1 | 382 | 1.482 | 0.500 | 1 | 1.477 | 0.000 | 1 | 2 | 1 | 0.073 | -2.000 | 0.026 |
| Medu* | 2 | 382 | 3.806 | 1.086 | 4 | 3.892 | 1.483 | 1 | 5 | 4 | -0.384 | -1.037 | 0.056 |
| failures | 3 | 382 | 0.202 | 0.583 | 0 | 0.033 | 0.000 | 0 | 3 | 3 | 3.034 | 8.689 | 0.030 |
| absences | 4 | 382 | 4.500 | 5.463 | 3 | 3.536 | 2.965 | 0 | 45 | 45 | 3.187 | 16.186 | 0.279 |
| G3 | 5 | 382 | 11.458 | 3.310 | 12 | 11.631 | 2.965 | 0 | 18 | 18 | -0.459 | 0.180 | 0.169 |
p1 <- ggplot(alc, aes(x = sex, fill = sex))
p1 + geom_bar() + theme(legend.position = "none") + ggtitle("Students' sex")+ ylab(" ") + xlab("Sex")
alc %>% count(sex) %>% kable(caption = "<b>Tale 3</b> Students' sex") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F, position = "left") %>% column_spec(2, width = "15em")
| sex | n |
|---|---|
| F | 198 |
| M | 184 |
p2 <- ggplot(alc, aes(x = high_use, fill = sex))
p2 + geom_bar() + theme(legend.position = "none") + facet_wrap(~sex) +
labs(title = "High alcohol consumption of female (left) and male (right) students",
caption = "TRUE = high consumption, FALSE = low consumption") +
ylab(" ") + xlab("High alcohol consumption")
alc %>%
group_by(high_use, sex)%>%
summarise(n=n())%>%
spread(sex, n) %>%
kable(caption = "<b>Table 3</b> High/low alcohol consumption by students' sex crosstabulated") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| high_use | F | M |
|---|---|---|
| FALSE | 156 | 112 |
| TRUE | 42 | 72 |
p3 <- ggplot(alc, aes(x = Medu))
p3 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Mother's education") + ylab(" ") + xlab("Education level of mother")
alc %>%
group_by(high_use, Medu)%>%
summarise(n=n())%>%
spread(Medu, n) %>%
kable(caption = "<b>Table 4</b> High/low alcohol consumption by mother's education crosstabulated") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| high_use | None | Primary | 5th to 9th | Secondary | Higher |
|---|---|---|---|---|---|
| FALSE | 1 | 33 | 80 | 59 | 95 |
| TRUE | 2 | 18 | 18 | 36 | 40 |
p3 <- ggplot(alc, aes(x = failures))
p3 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Class failures") + ylab(" ") + xlab("Number of class failures")
alc %>%
group_by(high_use, failures)%>%
summarise(n=n())%>%
spread(failures, n) %>%
kable(caption = "<b>Table 5</b> High/low alcohol consumption by mother's education crosstabulated") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| high_use | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| FALSE | 244 | 12 | 10 | 2 |
| TRUE | 90 | 12 | 9 | 3 |
p4 <- ggplot(alc, aes(x = absences))
p4 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Absences") + ylab(" ") + xlab("Number of absences")
p5 <- ggplot(alc, aes(x = high_use, y = absences, fill = sex))
p5 + geom_boxplot() + ggtitle("Boxplots of absences vs. high/low alcohol use by gender") + ylab("Number of absences") + xlab("High alcohol consumption")
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences)) %>% kable(digits = 2, caption = "<b>Table 5</b> Mean number absences by high/low alcohol consumption and sex") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| sex | high_use | count | mean_absences |
|---|---|---|---|
| F | FALSE | 156 | 4.22 |
| F | TRUE | 42 | 6.79 |
| M | FALSE | 112 | 2.98 |
| M | TRUE | 72 | 6.12 |
p6 <- ggplot(alc, aes(x = G3))
p6 + geom_histogram(color = "white", fill = "deepskyblue2", bins=18) + theme_classic() + ggtitle("Grade") + ylab(" ") + xlab("Grade")
p7 <- ggplot(alc, aes(x = high_use, y = G3, fill = sex))
p7 + geom_boxplot() + ggtitle("Boxplots of grades vs. high/low alcohol use by gender") + ylab("Grade") + xlab("High alcohol consumption")
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3)) %>% kable(digits = 3, caption = "<b>Table 6</b> Mean grade by high/low alcohol consumption and sex") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| sex | high_use | count | mean_grade |
|---|---|---|---|
| F | FALSE | 156 | 11.397 |
| F | TRUE | 42 | 11.714 |
| M | FALSE | 112 | 12.205 |
| M | TRUE | 72 | 10.278 |
# Fitting the logistic regression
mod1 <- glm(high_use ~ Medu + failures + absences + sex, data = alc, family = "binomial")
# Summary of the model
summary(mod1)
##
## Call:
## glm(formula = high_use ~ Medu + failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3466 -0.8419 -0.6057 1.0340 2.3030
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.07693 1.26355 0.061 0.9514
## MeduPrimary -1.67979 1.29798 -1.294 0.1956
## Medu5th to 9th -2.65576 1.29394 -2.052 0.0401 *
## MeduSecondary -1.72156 1.28764 -1.337 0.1812
## MeduHigher -2.00750 1.28668 -1.560 0.1187
## failures 0.43321 0.20127 2.152 0.0314 *
## absences 0.09627 0.02427 3.967 7.28e-05 ***
## sexM 0.97945 0.24932 3.928 8.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 412.83 on 374 degrees of freedom
## AIC: 428.83
##
## Number of Fisher Scoring iterations: 4
As Medu does not consistently predict high alcohol consumption well (the only significant coefficient being class “5th to 9th” [p =.040], I omitted the variable from the model. This also makes the model easier to interpret.
# Fittiing Model 2
mod2 <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
# Summary of Model 2
summary(mod2)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1855 -0.8371 -0.6000 1.1020 2.0209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.90297 0.22626 -8.411 < 2e-16 ***
## failures 0.45082 0.18992 2.374 0.017611 *
## absences 0.09322 0.02295 4.063 4.85e-05 ***
## sexM 0.94117 0.24200 3.889 0.000101 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 424.40 on 378 degrees of freedom
## AIC: 432.4
##
## Number of Fisher Scoring iterations: 4
# Computing odds ratios and confidence intervals
OR <- coef(mod2) %>% exp
CI <- confint(mod2) %>% exp
## Waiting for profiling to be done...
# Printing out the odds ratios with their confidence intervals
cbind(OR, CI) %>% kable(digits = 3, caption = "<b>Table 7</b> Odds ratios and their confidence intervals") %>% kable_styling(bootstrap_options = c("striped", "hover"))
| OR | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 0.149 | 0.094 | 0.229 |
| failures | 1.570 | 1.083 | 2.295 |
| absences | 1.098 | 1.052 | 1.151 |
| sexM | 2.563 | 1.604 | 4.149 |
As we can see from the model summary, the intercept of high alcohol consumption is -1.90, which is more than 8 standard deviations (z value or the Wald’s Test value) away from 0 on the standard normal curve with a statistically significant p < .001. The slope coefficient of, for example, absences is 0.093. This means that for one point increase in absences the log of the odds of high alcohol consumption increases 0.09. The z values of coefficients of failures, absences, and sex ar positive and over 2 standard deviations away from 0 and are statistically significant with p < .05 for failures and p < .001 for absences and sex.
From the odds ratios we we can see that when the effect of the other predictor variables are taken into account…
Conclusion: As expected, class failures, school absences, and student’s sex predict higher alcohol consumption. Male students are more likely be high alcohol consumers. Class failures and absentees also increase the probability of higher consumption. However, mother’s education doesn’t seem to predict alcohol use consistently.
# predict() the probability of high_use
probabilities <- predict(mod1, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, Medu, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
## Medu failures absences sex high_use probability prediction
## 373 Primary 1 0 M FALSE 0.45259271 FALSE
## 374 Higher 1 7 M TRUE 0.53891660 TRUE
## 375 5th to 9th 0 1 F FALSE 0.07708996 FALSE
## 376 Higher 0 6 F FALSE 0.20538904 FALSE
## 377 5th to 9th 1 2 F FALSE 0.12421781 FALSE
## 378 Secondary 0 2 F FALSE 0.18968092 FALSE
## 379 Primary 2 2 F FALSE 0.36728009 FALSE
## 380 Primary 0 3 F FALSE 0.21181023 FALSE
## 381 Secondary 0 4 M TRUE 0.43043082 FALSE
## 382 Secondary 0 2 M TRUE 0.38399298 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 257 11
## TRUE 78 36
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67277487 0.02879581 0.70157068
## TRUE 0.20418848 0.09424084 0.29842932
## Sum 0.87696335 0.12303665 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2329843
The average number of frong predictions in the data is 23% using student’s sex, absences, and class failures as predictors. This means that the prediction was right about three times out of four. This is significantly better than by just guessing (error rate of 50%). The model was especially accurate at predicting low alcohol consumption by preidcting correctly 257 times out of 268. However, the model predicted wrong most of the cases where the alcohol consumption was categorized high.